Same structures hold for mechanistic queries and non-mechanistic clues
The estimand is:
We often seek the probability of causation. Using potential outcomes notations: \(\Pr(Y(0)=0 | Y(1)=1)\).
Arguably this is an answer not an estimand. Still it’s our focus: we seek the answer that is defensible given the data and discuss the identifiability of this quantity.
Say that we have lots of data from a randomized experiment and we know that the effect of X on Y is 2/3. In particular we have infinite data supporting the following conditional distribution of \(Y\) given an application of \(X\):
| Y = 0 | Y = 1 | |
|---|---|---|
| \(X=0\) | 2/3 | 1/3 |
| \(X=1\) | 1/3 | 2/3 |
What is the probability that \(X\) caused \(Y\) for a case from this population? (an “exchangeable” case)
| Y = 0 | Y = 1 | |
|---|---|---|
| \(X=0\) | 2/3 | 1/3 |
| \(X=1\) | 1/3 | 2/3 |
From this data alone either of the following distributions of potential outcomes are possible:
PC is 1 in the former case and 0.5 in the latter case so bounds are [0.5, 1]
Sometimes distributions allow for tighter distributions:
| Y = 0 | Y = 1 | |
|---|---|---|
| \(X=0\) | 1 | 0 |
| \(X=1\) | 0.5 | 0.5 |
Here \(X=1\) is necessary for \(Y=1\). From this we know that if \(X=1, Y=1\) then \(X=1\) caused \(Y=1\)
But for a \(X=0, Y=0\) we don’t know if \(X=0\) caused \(Y=0\)
Sometimes distributions allow for tighter distributions:
| Y = 0 | Y = 1 | |
|---|---|---|
| \(X=0\) | 0.5 | 0.5 |
| \(X=1\) | 0 | 1 |
Here \(X=1\) is sufficient for \(Y=1\). From this we know that if \(X=0, Y=0\) then \(X=0\) caused \(Y=0\)
But for a \(X=1, Y=1\) we don’t know if \(X=1\) caused \(Y=1\). In fact POC = \(0.5\).
Say now that we could decompose the \(X\rightarrow Y\) process into a 2 step process. \(X\rightarrow M \rightarrow Y\).
For an \(X=1, Y=1\) case, is learning about \(M\) informative for the probability that \(X=1\) caused \(Y=1\)?
Imagine:
We learn nothing in the first case but might learn a lot in the second case.
Take the second case. We compose this:
| \(Y = 0\) | \(Y = 1\) | |
|---|---|---|
| \(X=0\) | 0.5 | 0.5 |
| \(X=1\) | 0.25 | 0.75 |
with POC of \(\left[\frac13, \frac23\right]\). Into:
| \(M = 0\) | \(M = 1\) | |
|---|---|---|
| \(X=0\) | 1 | 0 |
| \(X=1\) | 0.5 | 0.5 |
| \(Y = 0\) | \(Y = 1\) | |
|---|---|---|
| \(M=0\) | 0.5 | 0.5 |
| \(M=1\) | 0 | 1 |
Say now that we could decompose the \(X\rightarrow Y\) process into a 10 step process, with an effect of 0.9 at every step.
| \(M_{j+1} = 0\) | \(M_{j+1} = 1\) | |
|---|---|---|
| \(M_j=0\) | 0.95 | 0.05 |
| \(M_j=1\) | 0.05 | 0.95 |
Then the upper bound at each step remains at 1. The lower bound is \(\frac{0.9}{0.95}\) which is about 95%.
However \(0.95^{10} < 0.60\) which means bounds are now [0.6, 1]: so better but not much better than what we had before.
Knowing a lot about many steps means that you have greater certainty at each step, but there are more sites for leakage and so the accumulation of confidence is not large.
Suppose all matrices are equal:
PC bounds (red for observed, blue for non observed mediators) tighten only modestly as the length of the homogeneous chain increases
We find the largest and smallest upper and lower bounds from any complete mediation process, for different types of evidence.
The general idea for case level process tracing following data based model training can be generalized to:
See Humphreys and Jacobs Integrated Inferences
stan structure used for estimationIn qualitative inference a “hoop” test is a search for a clue that, if absent, greatly reduces confidence in a theory.
Define a model with \(X\) causing \(Y\) through \(M\) but with confounding.
We imagine a real world in which there are in fact monotonic effects and no confounding, though this is not known. (The data suggests a process in which \(X\) is necessary for \(M\) and \(M\) sufficient for \(Y\))
Define the model, then update, and query.
| Given | truth | prior | post.mean | sd |
|---|---|---|---|---|
| X==1 & Y==1 | 0.62 | 0.268 | 0.313 | 0.183 |
| X==1 & Y==1 & M==1 | 0.70 | 0.250 | 0.354 | 0.206 |
| X==1 & Y==1 & M==0 | 0.00 | 0.250 | 0.005 | 0.006 |
We see that we can find \(M\) informative about whether \(X\) caused \(Y\) in a case specifically when we see \(M=0\).
This is striking because:
We can do similarly for other Van Evera tests, specifically using moderators to generate “doubly decisive” tests.
Assume a world, like above, where in fact \(X \rightarrow M \rightarrow Y\), all effects strong (80%, 80%).
| Query | Given | Using | mean | sd |
|---|---|---|---|---|
| Q 1 | - | posteriors | 0.4 | 0.09 |
| Q 1 | M==0 | posteriors | 0.4 | 0.12 |
| Q 1 | M==1 | posteriors | 0.4 | 0.13 |
This negative result holds even if we can exclude \(X \rightarrow Y\)
This example illustrates the Cartwright idea of no causes in, no causes out.
Institutions and Growth Model
Case level inferences given possible observations of distance and mortality.
for a case with weak institutions and low growth (first column), the former likely caused the latter. Similarly for cases with strong institutions and growth (last column).
cases with weak institutions and high growth (and vice versa) relationship unlikely causal
in a strong institutions / high growth case, proximity to the equator increases confidence that the strong institutions helped: despite the fact that distance and institutions are complements for the average treatment effect
Mortality is informative about the effect of institutions on growth even if we already know
Learned patterns of confounding are consistent with a world in which settlers responded to low mortality by building strong institutions specifically in those places where they rationally expected strong institutions to help.
Correlated posteriors
Complexity grows quickly
We consider a binary treatment variable \(X\) and binary outcome variable \(Y\). We have experimental data supplying values for \(\Pr(Y=y \mid X\leftarrow x),\) where \(X\leftarrow x\) denotes a regime in which \(X\) is set to value \(x\) by external intervention. Define \[\begin{eqnarray*} \tau &:=& \Pr(Y=1\mid X\leftarrow 1) - \Pr(Y=1\mid X\leftarrow 0)\\ \rho &:=& \Pr(Y=1\mid X\leftarrow 1) - \Pr(Y=0\mid X\leftarrow 0). \end{eqnarray*}\] Then \(\tau\) is the average causal effect of \(X\) on \(Y\), while \(\rho\) is a measure of how common the outcome is.
The transition matrix from \(X\) to \(Y\) is then \[\begin{equation} \label{eq:P} P = P(\tau,\rho) :=\left( \begin{array}[c]{cc} \frac 1 2(1+\tau-\rho) & \frac 1 2(1-\tau+\rho)\\ \frac 1 2(1-\tau-\rho) & \frac 1 2(1+\tau+\rho) \end{array} \right) \end{equation}\] where we have \[\begin{equation} \label{eq:rhotau} |\rho| + |\tau| \leq 1. \end{equation}\]
We have equality if and only if one of the entries of \(P\) is 0, in which case we term \(P\) degenerate.
We have observed treatment \(X=1\) and outcome \(Y=1\), and want to know the probability that the treatment caused the response. We introduce the potential responses, \({\mbox{$\mathbf Y$}}= (Y_0,Y_1)\), with \(Y_x\) the potential value of \(Y\) when setting \(X \leftarrow x\). We will assume no confounding ( \(X \hspace{1.5mm}\perp\hspace{-3.9mm}\perp\hspace{1.5mm}{\mbox{$\mathbf Y$}}\)). \
The probability of causation is defined as:
\[\text{PC}: = \Pr(Y_0 =0 \mid Y_1 =1, X=1)\]
The joint distribution for \({\mbox{$\mathbf Y$}}\) has the form of the following table:
The entries of the table are not determined by \(P\), but have one degree of freedom, expressed by the ``slack’’ quantity \(\xi\) = \(\xi(P)\).
We only know \(|\tau| \leq \xi \leq 1-|\rho|.\) In particular if \(\tau>0\), \(X=1\), \(Y=1\), we have}
\[\begin{equation*} \text{PC} = \frac{\xi+\tau}{2\Pr(Y=1 \mid X \leftarrow 1)}, \end{equation*}\]
with the following interval bounds:
\[\begin{equation*} \text{LB}:=\frac{2\tau}{1+\tau+\rho} \leq \text{PC} \leq \frac{1+\tau-|\rho|}{1+\tau+\rho} =: \text{UB}. \end{equation*}\]
\(\text{PC}\) is identified if and only if \(|\rho| = 1-\tau\), which holds when \(P\) is degenerate with either the lower left or upper right element of \(P\) being 0. In the former case \(\text{PC}=\tau\), while in the latter case \(\text{PC} = 1\). More generally, we have \(\text{LB} = \tau/\Pr(Y=1 \mid X \leftarrow 1) \geq \tau\), so \(\text{PC} \geq \tau\).
We can sometimes improve these bounds if we can observe additional variables.
Here we consider a chain of binary mediators that form a complete mediation sequence \[X \equiv M_0 \rightarrow M_1 \ldots \rightarrow M_n \equiv Y\]
Suppose we are able to determine \(\Pr(M_{i+1} = m_{i+1} \mid M_i \leftarrow m_i)\) and that there is no confounding at every step
In this case we have a (generally non-stationary) Markov chain.
Let the transition matrix from \(M_{i-1}\) to \(M_{i}\) be \(P_i = P(\tau_i,\rho_i)\), and the overall transition matrix from \(X\) to \(Y\) be \(P = P(\tau,\rho)\).
We have \[P = P^{(n)} :=\prod_{i = 1}^n P_{i}, \quad\quad \tau := \prod_{i=1}^{n} \tau_i,\qquad\qquad \rho := \sum_{i=1}^{n} \rho_i\prod_{j=i+1}^n\tau_j.\]
We have: The average causal effect of \(X\) on \(Y\) is the product of the successive average causal effects of each variable in the sequence on the following one.
We introduce the potential variables \({\mbox{$\mathbf M$}}_i :=(M_{i0}, M_{i1})\), where \(M_{im}\) denotes the potential value of \(M_i\) under \(M_{i-1} \leftarrow m\) (supposed unaffected by values of previous \(M\)’s).
We impose mutual independence between \(X\), \({\mbox{$\mathbf M$}}_1\),,\({\mbox{$\mathbf M$}}_n\) (no confounding).